We will examine the file data/college_data.csv. The file contains more than 160 different variables for the 395 doctoral universities in the US. The data was obtained from Integrated Postsecondary Education Data System (IPEDS). The variables correspond the most recent available academic year. In most of the cases, this mean 2018-19. However, some variables are from the 2017-18 year.

Setup

knitr::opts_chunk$set(fig.align='center')
knitr::opts_chunk$set(out.width='100%')
source('utils/utils.R')
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
data <- read.csv('data/college_data.csv')
ID <- "Institution.Name" 
SZ <- "Total..enrollment..DRVEF2018." 
carnegie <- "Carnegie.Classification..HD2018."
landgrant <- "Land.Grant.Institution..HDNo0Yes8."
sector <- "Sector.of.institution..HDPrivate0Public8."

Snapshot: Staff vs Faculty

Y <- "Total.FTE.staff..DRVHR2018."
X <- "Instructional..research.and.public.service.FTE..DRVHR2018."  

regression <- linregression_traditional(data,X,Y,carnegie)

summary_reg <- summary(regression)
rsquared <- summary_reg$r.squared
coeffs <- summary_reg$coefficients

Use ggplot2 instead: nicer plots

foo <- ggplotRegression(data,Y,X,carnegie,sector,ID,SZ,'US Doctoral Universities')
foo[[1]]

summary_reg <- summary(foo[[2]])
rsquared <- summary_reg$r.squared
coeffs <- summary_reg$coefficients

Use plotly to get an interactive plot!

ggplotly(foo[[1]], tooltip = c('id', 'enroll', 'x', 'y'))
library(webshot)
webshot::install_phantomjs()

Look for other possible correlations

X <- "Undergraduate.enrollment..DRVEF2018."  
Y <- "Graduate.enrollment..DRVEF2018."  
foo <- ggplotRegression(data,Y,X,carnegie,sector,ID,SZ,'US Doctoral Universities')
foo[[1]]

summary_reg <- summary(foo[[2]])
rsquared <- summary_reg$r.squared
coeffs <- summary_reg$coefficients

Instead, let’s look at full time

X <- "Full.time.undergraduate.enrollment..DRVEF2018."  
Y <-  "Full.time.graduate.enrollment..DRVEF2018."  
foo <- ggplotRegression(data,Y,X,carnegie,sector,ID,SZ,'US Doctoral Universities')
foo[[1]]

summary_reg <- summary(foo[[2]])
rsquared <- summary_reg$r.squared
coeffs <- summary_reg$coefficients
  • Considering full time enrollment doesn’t make the trend clearer.
  • We observe an \(R^2\) value of 0.3332552, which remains low.
  • For every enrolled undergrad, we have 0 enrolled grad student.

What about the number of degrees granted?

p <- ggplotJitter(data, PhD, carnegie, sector, landgrant, ID, SZ)
ggplotly(p, tooltip = c('id', 'enroll', 'y'))
df <- data.frame(
  data[c(carnegie,PhD,ID,SZ,sector,landgrant)],
  Students.receiving.a.PhD.normalized..DRVC2018. = as.vector(unlist(data[PhD]/data[GradT]))*100
)
Normalized <- "Students.receiving.a.PhD.normalized..DRVC2018."

p <- ggplotJitter(df, Normalized, carnegie, sector, landgrant, ID, SZ)
ggplotly(p, tooltip = c('id', 'enroll', 'y'))

Your Turn